指揮者:編譯器驅動程式
將 編譯器驅動程式 (如 GCC)視為一位宏偉的指揮家。它自動化從人類可讀的原始碼轉換為二進位可執行檔的複雜過程。這段旅程,即 執行之路,始於 編譯時間 並延伸至 載入時間 與 執行時間。
透過使用 獨立編譯,驅動程式會分別處理 main.c 與 sum.c 。一個模組的變更不需要重新翻譯整個專案——僅需將修改過的檔案經過前置處理器(cpp)、編譯器(cc1)及組譯器(as),再交由 連結器 (ld)整合產生的 可重定位目標檔案。
效率與記憶體階層
連結器對於 grid[0][0] 或 src[0][0] 直接影響 吞吐量 與 延遲。透過將資料對齊至 32 位元組快取行,驅動程式促成了 步距為 1 的參考模式,最小化 冷缺失 並避免 欄位掃描所導致的快取剔除。在高階高效能程式碼中, 迴圈展開平行性($4 \times 4$ 展開迴圈) 進一步隱藏 主記憶體到快取的對映 延遲,透過優化時脈頻率週期(0x32、0x1、0x4、0x51)來達成。
main.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
QUESTION 1
Which component of the compiler driver is responsible for generating the assembly file (/tmp/main.s)?
The preprocessor (cpp)
The compiler (cc1)
The assembler (as)
The linker (ld)
✅ Correct!
Correct! cc1 translates the preprocessed C code into assembly code.❌ Incorrect
The preprocessor only handles macros and headers. cc1 is the stage that produces assembly.QUESTION 2
What is a primary benefit of 'Separate Compilation'?
It makes the final executable run faster.
It allows modifications to one file without re-translating others.
It automatically unrolls all loops to 4x4.
It eliminates the need for a linker.
✅ Correct!
Indeed. This modularity is essential for large projects to maintain manageable build times.❌ Incorrect
Separate compilation focuses on translation efficiency, not execution speed directly.QUESTION 3
How does a Stride-1 reference pattern affect the L1 cache?
It causes column-wise scan evictions.
It maximizes hit rates by utilizing spatial locality.
It bypasses the cache to reduce latency.
It increases the number of cold misses to 100%.
✅ Correct!
Correct. Accessing memory sequentially ensures that once a 32-byte cache line is loaded, subsequent nearby data is already in the cache.❌ Incorrect
Stride-1 actually minimizes evictions compared to large strides or column-wise scans.QUESTION 4
What happens at 0x064C if the linker places a multi-byte integer across a 32-byte cache boundary?
The compiler driver automatically fixes it at run time.
The L1 cache throughput is maximized.
A potential drop in hit rates and increased latency occurs.
The assembler produces a relocatable error.
✅ Correct!
Unaligned data or data spanning boundaries requires multiple cache fetches, hurting performance.❌ Incorrect
This is a performance issue that neither the driver nor assembler automatically 'fixes' without specific alignment directives.QUESTION 5
The hex representations 0x32, 0x1, 0x4, and 0x51 in the theory likely represent:
The binary tags for the L2 cache.
Clock frequency stalls or memory fetch latencies.
The sequence of registers used in a 4x4 unroll.
The static library identifiers.
✅ Correct!
Correct. These illustrate the raw timing variations involved in memory access cycles.❌ Incorrect
These values are used to describe performance characteristics during advanced optimization.Case Study: Memory Hierarchy & Hit Rates
Applying Figure 6.48 logic to cache performance
You are analyzing the performance of a program that transposes a matrix using two arrays: src and dst. Both are stored in memory addresses similar to 0x064C, 0x064D, 0x064E, and 0x064F. The system uses a 32-byte cache line. You must calculate how the driver's linking stage and the memory access pattern interact.
Q
Based on Figure 6.48, what is the hit rate for the dst and src arrays when the cache is 32 bytes and large enough to hold both arrays?
Solution:
Assuming 4-byte integers and a 32-byte cache block, there are 8 integers per cache line (32 / 4 = 8). If the cache is large enough to hold both arrays, conflict and capacity misses are eliminated, leaving only cold misses. For a Stride-1 access pattern (reading row by row), the first access to a block is a miss, and the following 7 accesses are hits. Therefore, the hit rate for both arrays is 7/8, or 87.5%.
Assuming 4-byte integers and a 32-byte cache block, there are 8 integers per cache line (32 / 4 = 8). If the cache is large enough to hold both arrays, conflict and capacity misses are eliminated, leaving only cold misses. For a Stride-1 access pattern (reading row by row), the first access to a block is a miss, and the following 7 accesses are hits. Therefore, the hit rate for both arrays is 7/8, or 87.5%.
Q
If the code uses a column-wise scan instead of Stride-1, how does this affect the 'Cache tag' and 'Cache set index' usage?
Solution:
A column-wise scan increases the stride between consecutive accesses. This likely means every access jumps to a different cache set index or causes a tag mismatch, resulting in 0% hit rates (all misses) if the matrix is larger than the cache, as the hardware must evict lines before they can be reused.
A column-wise scan increases the stride between consecutive accesses. This likely means every access jumps to a different cache set index or causes a tag mismatch, resulting in 0% hit rates (all misses) if the matrix is larger than the cache, as the hardware must evict lines before they can be reused.